Application of the Double Metaphone Algorithm to Amharic Orthography

نویسنده

  • Daniel Yacob
چکیده

The Metaphone algorithm applies the phonetic encoding of orthographic sequences to simplify words prior to comparison. While Metaphone has been highly successful for the English language, for which it was designed, it may not be applied directly to Ethiopian languages. The paper details how the principles of Metaphone can be applied to Ethiopic script and uses Amharic as a case study. Match results improve as specific considerations are made for Amharic writing practices. Results are shown to improve further when common errors from Amharic input methods are considered. Introduction In the field of text analysis the concept of word distance is introduced and becomes of central importance to the problem. Word distance is the essence of word comparison which appears in innumerable problems of text processing. Text searching and document spell checking are but a few familiar examples. The notion of distance between groups of written symbols may at first seem highly abstract since we are trained to think of distance as intrinsically spatial. In the natural world we know distance as a geometric orientation between two tangible objects, measured with an equally tangible instrument. Word distances are unitless and determined by the number of corrections required, character by character, to transform one word into another. Each type of correction is given a (likewise unitless) “weight” value where the magnitude may reflect the severity of the discrepancy. The sum of these values then determines the overall magnitude of the word distance. In naive form all symbols will be treated equally. A set of words of equal length with completely unlike phonemic sequence will appear equally distant from a comparison word. To improve upon the distance distribution, linguistic and orthographical information may be applied as a stage of the distance analysis. Most commonly, linguistic analysis is applied to precondition words to better improve their apparent proximity from the comparison word. The preconditioning approach is particularly useful and has as its aim the rectification of the disparity between the phonology of the spoken word and the orthography of the written word. For example vowel clusters would be represented by single symbol. This paper presents an investigation into the application to Amharic orthography of a very successful preconditioning technique, the Metaphone algorithm, used for English and European languages. Before presenting the details of the technique itself, it is useful to first review the nature of the problem that it will be applied to. Problems in Amharic Spelling Amharic orthography reflects the spoken phonetic features to a large extent. So closely that one can be lead to believe that there is no notion of “spelling” in Amharic. The rule generally followed is “if a word sounds right when read aloud then it was rightly written”. Upon closer inspection we quickly realize that Amharic spelling rules are just very forgiving when compared to the strict, albeit irregular, conventions of English. When compared to English, the Amharic author is to a degree liberated to put more cognitive energy into qualitative writing and less into how words must be written. Though many fewer and requiring less conscious attention, Amharic spelling very clearly has rules. For example, some phonetical spelling variations are more acceptable than others. While few would give pause to “ውኃ” vs “ዉሀ”, or even “ታህሳስ” vs “ታኅሣሥ”, the rendering “ዓዲሥ Aበባ Iትዮጵያ” while phonetically valid is not an acceptable replacement for “Aዲስ Aበባ Iትዮጵያ”. Though we will explore a number of other categories of Amharic misspellings, spelling error correction may be viewed in large part as a standardization effort. Levels of Amharic Spelling In Ethiopian society there are acceptable levels of precision, a phono-orthographical radius that renderings may fall within to be considered recognizable and acceptable. The sarcastic quip of Mark Twain: “I respect a man who knows how to spell a word more than one way”

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Grapheme-to-Phoneme Conversion for Amharic Text-to-Speech System

Developing correct Grapheme-to-Phoneme (GTP) conversion method is a central problem in text-tospeech synthesis. Particularly, deriving phonological features which are not shown in orthography is challenging. In the Amharic language, geminates and epenthetic vowels are very crucial for proper pronunciation but neither is shown in orthography. This paper describes an architecture, a preprocessing...

متن کامل

Syllable-Based Speech Recognition for Amharic

Amharic is the Semitic language that has the second large number of speakers after Arabic (Hayward and Richard 1999). Its writing system is syllabic with Consonant-Vowel (CV) syllable structure. Amharic orthography has more or less a one to one correspondence with syllabic sounds. We have used this feature of Amharic to develop a CV syllable-based speech recognizer, using Hidden Markov Modeling...

متن کامل

A Double Metaphone Encoding for Approximate Name Searching and Matching in Bangla

Almost any word can be a Bangali name, and the name in turn is often spelled in many different ways, all of which are considered correct and interchangeable. The reason for the spelling complication is two-fold: (1) there is a large gap between the script and pronunciation in Bangla, largely attributed to the large scale Sanskritization process that started in the 12 century and continued throu...

متن کامل

Morpho-syntactically Annotated Amharic Treebank

In this paper, we describe an ongoing project of developing a treebank for Amharic. The main objective of developing the treebank is to use it as an input for the development of a parser. Morphologically-rich Languages like Arabic, Amharic and other Semitic languages present challenges to the state-of-art in parsing. In such language morphemes play important functions in both morphology and syn...

متن کامل

High speed data retrieval from national data center (ndc) reducing time and ignoring spelling error in search key based on double metaphone algorithm

Fast and efficient data management is one of the demanding technologies of today’s aspect. This paper proposes a system which makes the working procedures of present manual system of storing and retrieving huge citizen’s information of Bangladesh automated and increases its effectiveness. The implemented search methodology is user friendly and efficient enough for high speed data retrieval igno...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره cs.CL/0408052  شماره 

صفحات  -

تاریخ انتشار 2004